Welcome to your first project! We will be exploring possible connections between water usage, geography, and income in California. The water data for this project was procured from the California State Water Resources Control Board and curated by the Pacific Institute. The map data includes US topography, California counties, and ZIP codes.
The dataset on income comes from the IRS (documented here). We have identified some interesting columns in the dataset, but a full description of all the columns (and a definition of the population in the dataset and some interesting anonymization procedures they used) is available here.
Your project will be due Friday, March 3 at 11PM. If you like, you may work with one partner. Outside of your partner, your work should be your own work. This project has some hard pieces! It is okay if you do not solve every task, but you should try hard on every one and document your efforts. I and the TAs (Zoe and Lesley) are very happy to help with hints and nudges as well as looking for bugs. If you start early, you should have plenty of time to ask questions.
# Run this cell, but please don't change it.
import numpy as np
import math
from datascience import *
# These lines set up the plotting functionality and formatting.
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
First, load the data. Loading may take some time. You will need to upload the two files from Moodle to the same location on your Jupyter hub that you have Project 1 located. First download them from Moodle and then use the Upload button on the Jupyter hub to upload them.
# Run this cell, but please don't change it.
districts = Map.read_geojson('water_districts.geojson')
zips = Map.read_geojson('ca_zips.geojson.gz')
usage_raw = Table.read_table('~/DS_113_S23/Projects/Project_1/water_usage.csv', dtype={'pwsid': str})
income_raw = Table.read_table('~/DS_113_S23/Projects/Project_1/ca_income_by_zip.csv', dtype={'ZIP': str}).drop('STATEFIPS', 'STATE', 'agi_stub')
wd_vs_zip = Table.read_table('~/DS_113_S23/Projects/Project_1/wd_vs_zip.csv', dtype={'PWSID': str, 'ZIP': str}).set_format(make_array(2, 3), PercentFormatter)
The districts and zips data sets are Map objects. Documentation on mapping in the datascience package can be found at data8.org/datascience/maps.html. To view a map of California's water districts, run the cell below. Click on a district to see its description.
districts.format(width=800, height=600)
A Map is a collection of regions and other features such as points and markers, each of which has a string id and various properties. You can view the features of the districts map as a table using Table.from_records.
district_table = Table.from_records(districts.features)
district_table.show(3)
| PWSID | feature | id | popupContent |
|---|---|---|---|
| 0110001 | Make this Notebook Trusted to load map: File -> Trust Notebook | 0 | Alameda County Water District |
| 0110003 | Make this Notebook Trusted to load map: File -> Trust Notebook | 1 | California Water Service Company Livermore |
| 0110005 | Make this Notebook Trusted to load map: File -> Trust Notebook | 2 | East Bay Municipal Utilities District |
... (407 rows omitted)
To display a Map containing only two features from the district_table, call Map on an array containing those two features from the feature column.
Question 1.1. Draw a map of the Alameda County Water District (row 0) and the East Bay Municipal Utilities District (row 2).
# Fill in the next line to make an array containing the two elements from district_table
# that you want out of the feature column.
alameda_and_east_bay = make_array(district_table["feature"][0],district_table["feature"][2])
Map(alameda_and_east_bay, height=300, width=300)
Hint: If scrolling becomes slow on your computer, you can clear maps for the cells above by running Cell > All Output > Clear from the Cell menu.
Let's look at the income_raw table, which comes from the IRS. We're going to link this information about incomes to our information about water. First, we need to investigate the income data and get it into a more usable form.
income_raw
| ZIP | N1 | MARS1 | MARS2 | MARS4 | PREP | N2 | NUMDEP | A00100 | N02650 | A02650 | N00200 | A00200 | N00300 | A00300 | N00600 | A00600 | N00650 | A00650 | N00700 | A00700 | N00900 | A00900 | N01000 | A01000 | N01400 | A01400 | N01700 | A01700 | SCHF | N02300 | A02300 | N02500 | A02500 | N26270 | A26270 | N02900 | A02900 | N03220 | A03220 | N03300 | A03300 | N03270 | A03270 | N03150 | A03150 | N03210 | A03210 | N03230 | A03230 | N03240 | A03240 | N04470 | A04470 | A00101 | N18425 | A18425 | N18450 | A18450 | N18500 | A18500 | N18300 | A18300 | N19300 | A19300 | N19700 | A19700 | N04800 | A04800 | N05800 | A05800 | N09600 | A09600 | N07100 | A07100 | N07300 | A07300 | N07180 | A07180 | N07230 | A07230 | N07240 | A07240 | N07220 | A07220 | N07260 | A07260 | N09400 | A09400 | N10600 | A10600 | N59660 | A59660 | N59720 | A59720 | N11070 | A11070 | N10960 | A10960 | N06500 | A06500 | N10300 | A10300 | N85330 | A85330 | N85300 | A85300 | N11901 | A11901 | N11902 | A11902 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 90001 | 13100 | 6900 | 1890 | 4270 | 10740 | 29670 | 15200 | 181693 | 13100 | 184344 | 10220 | 142287 | 640 | 96 | 50 | 69 | 30 | 20 | 180 | 55 | 3490 | 31751 | 70 | -17 | 60 | 382 | 380 | 3001 | 0 | 950 | 4493 | 80 | 123 | 30 | 105 | 3510 | 2651 | 20 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 150 | 119 | 60 | 98 | 0 | 0 | 510 | 7212 | 8498 | 170 | 95 | 270 | 154 | 340 | 952 | 500 | 1424 | 340 | 3168 | 240 | 363 | 4200 | 20798 | 4190 | 2176 | 0 | 0 | 1800 | 623 | 0 | 0 | 70 | 28 | 520 | 248 | 240 | 39 | 1020 | 304 | 0 | 0 | 3290 | 4659 | 12180 | 37759 | 6320 | 20531 | 5880 | 16995 | 6500 | 8805 | 1230 | 980 | 2680 | 1554 | 5800 | 6197 | 0 | 0 | 0 | 0 | 1220 | 856 | 11480 | 32934 |
| 90001 | 5900 | 1700 | 1970 | 2210 | 4960 | 17550 | 9690 | 203628 | 5900 | 204512 | 5610 | 188556 | 830 | 68 | 60 | 32 | 70 | 19 | 660 | 351 | 540 | 5917 | 60 | 111 | 50 | 489 | 310 | 4450 | 0 | 460 | 2522 | 210 | 947 | 40 | 831 | 700 | 885 | 50 | 12 | 0 | 0 | 0 | 0 | 50 | 152 | 200 | 154 | 50 | 81 | 0 | 0 | 1260 | 20442 | 46800 | 800 | 1023 | 430 | 505 | 810 | 2383 | 1250 | 4605 | 820 | 7172 | 870 | 1620 | 5300 | 74352 | 5290 | 8710 | 0 | 0 | 3380 | 2947 | 0 | 0 | 370 | 249 | 620 | 518 | 630 | 114 | 2520 | 2047 | 20 | 6 | 410 | 982 | 5820 | 23531 | 2160 | 4562 | 1890 | 3948 | 2520 | 4160 | 600 | 441 | 3180 | 5762 | 3460 | 6754 | 0 | 0 | 0 | 0 | 530 | 797 | 5360 | 17656 |
| 90001 | 1480 | 330 | 760 | 390 | 1240 | 4710 | 2470 | 89065 | 1480 | 89344 | 1440 | 82579 | 400 | 32 | 20 | 12 | 0 | 0 | 460 | 477 | 130 | 1351 | 40 | 1 | 40 | 660 | 140 | 2599 | 0 | 150 | 892 | 60 | 670 | 0 | 0 | 210 | 279 | 50 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 90 | 76 | 0 | 0 | 0 | 0 | 680 | 13264 | 41920 | 540 | 1329 | 120 | 294 | 500 | 1453 | 680 | 3591 | 490 | 4248 | 560 | 1296 | 1470 | 49298 | 1470 | 6618 | 0 | 0 | 900 | 1488 | 0 | 0 | 120 | 90 | 180 | 198 | 80 | 11 | 750 | 1163 | 20 | 7 | 80 | 232 | 1470 | 8950 | 0 | 0 | 0 | 0 | 190 | 256 | 170 | 132 | 1290 | 5131 | 1310 | 5417 | 0 | 0 | 0 | 0 | 230 | 520 | 1250 | 4050 |
| 90001 | 330 | 50 | 210 | 70 | 290 | 1100 | 560 | 28395 | 330 | 28555 | 320 | 25271 | 130 | 10 | 20 | 24 | 30 | 29 | 160 | 235 | 60 | 1084 | 0 | 0 | 0 | 0 | 40 | 1044 | 0 | 40 | 230 | 30 | 598 | 0 | 0 | 60 | 160 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30 | 21 | 0 | 0 | 0 | 0 | 210 | 4846 | 18256 | 190 | 734 | 30 | 67 | 160 | 477 | 210 | 1437 | 160 | 1488 | 180 | 514 | 330 | 17918 | 330 | 2627 | 0 | 0 | 190 | 380 | 0 | 0 | 50 | 37 | 60 | 66 | 0 | 0 | 180 | 309 | 0 | 0 | 40 | 191 | 330 | 3127 | 0 | 0 | 0 | 0 | 0 | 0 | 50 | 41 | 330 | 2247 | 330 | 2398 | 0 | 0 | 0 | 0 | 70 | 289 | 270 | 1014 |
| 90001 | 160 | 30 | 100 | 40 | 130 | 510 | 250 | 24676 | 160 | 25017 | 150 | 17851 | 70 | 35 | 20 | 71 | 0 | 0 | 70 | 140 | 0 | 0 | 20 | 88 | 0 | 0 | 20 | 477 | 0 | 0 | 0 | 0 | 0 | 20 | 2967 | 40 | 186 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 130 | 4063 | 20530 | 120 | 1362 | 0 | 0 | 100 | 447 | 130 | 1980 | 90 | 1086 | 100 | 429 | 160 | 18375 | 160 | 3889 | 30 | 124 | 80 | 145 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 60 | 74 | 0 | 0 | 0 | 0 | 160 | 3868 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 160 | 3723 | 160 | 3897 | 0 | 0 | 0 | 0 | 60 | 702 | 100 | 474 |
| 90001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 90002 | 12150 | 6330 | 1460 | 4330 | 9580 | 27240 | 14070 | 167261 | 12150 | 170095 | 9440 | 126482 | 610 | 73 | 50 | 19 | 40 | 9 | 150 | 49 | 3610 | 34019 | 50 | 9 | 50 | 196 | 380 | 3133 | 0 | 900 | 4222 | 70 | 84 | 30 | 116 | 3630 | 2834 | 30 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 170 | 133 | 50 | 103 | 0 | 0 | 540 | 7237 | 8945 | 170 | 124 | 300 | 175 | 360 | 987 | 530 | 1621 | 390 | 3444 | 260 | 367 | 3750 | 18438 | 3750 | 1927 | 0 | 0 | 1800 | 618 | 0 | 0 | 100 | 35 | 540 | 242 | 240 | 39 | 980 | 300 | 0 | 0 | 3400 | 4983 | 11540 | 39973 | 6940 | 23487 | 6550 | 19268 | 6380 | 8439 | 1520 | 1330 | 2210 | 1310 | 5520 | 6298 | 0 | 0 | 0 | 0 | 930 | 642 | 10960 | 34691 |
| 90002 | 5030 | 1510 | 1490 | 1980 | 4120 | 14410 | 7890 | 173280 | 5030 | 174335 | 4760 | 159099 | 790 | 52 | 60 | 41 | 40 | 17 | 620 | 351 | 560 | 6015 | 70 | -10 | 40 | 237 | 320 | 5075 | 0 | 390 | 2070 | 180 | 831 | 0 | 0 | 710 | 1055 | 50 | 11 | 0 | 0 | 0 | 0 | 40 | 107 | 190 | 179 | 50 | 99 | 0 | 0 | 1330 | 21281 | 49334 | 840 | 1008 | 440 | 489 | 900 | 2430 | 1320 | 4684 | 920 | 7738 | 960 | 2253 | 4550 | 63887 | 4540 | 7503 | 0 | 0 | 2840 | 2550 | 0 | 0 | 330 | 234 | 590 | 523 | 500 | 89 | 2050 | 1688 | 30 | 9 | 420 | 1003 | 4960 | 20637 | 2020 | 4371 | 1750 | 3736 | 2140 | 3459 | 560 | 455 | 2740 | 4953 | 3070 | 5998 | 0 | 0 | 0 | 0 | 420 | 782 | 4590 | 15527 |
| 90002 | 1320 | 300 | 600 | 400 | 1060 | 4090 | 2180 | 78559 | 1320 | 78871 | 1270 | 72098 | 390 | 62 | 40 | 9 | 50 | 4 | 460 | 462 | 100 | 596 | 30 | 65 | 40 | 552 | 170 | 3512 | 0 | 120 | 621 | 70 | 859 | 0 | 0 | 200 | 312 | 50 | 11 | 0 | 0 | 0 | 0 | 20 | 54 | 110 | 99 | 0 | 0 | 0 | 0 | 710 | 13730 | 42942 | 570 | 1316 | 130 | 266 | 540 | 1495 | 710 | 3603 | 530 | 4724 | 550 | 1480 | 1310 | 42787 | 1310 | 5729 | 20 | 24 | 780 | 1314 | 0 | 0 | 120 | 94 | 170 | 176 | 70 | 10 | 670 | 1023 | 20 | 6 | 60 | 118 | 1310 | 7846 | 0 | 0 | 0 | 0 | 190 | 272 | 150 | 115 | 1120 | 4415 | 1140 | 4550 | 0 | 0 | 0 | 0 | 180 | 359 | 1130 | 3645 |
| 90002 | 340 | 90 | 190 | 90 | 270 | 1060 | 530 | 28502 | 340 | 28558 | 320 | 25304 | 110 | 29 | 20 | 3 | 0 | 0 | 160 | 224 | 40 | 235 | 20 | 193 | 0 | 0 | 50 | 1538 | 0 | 40 | 218 | 40 | 570 | 0 | 0 | 40 | 56 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 230 | 5033 | 19189 | 200 | 780 | 30 | 73 | 180 | 545 | 230 | 1521 | 170 | 1601 | 200 | 685 | 340 | 18186 | 340 | 2749 | 0 | 0 | 190 | 365 | 0 | 0 | 60 | 40 | 50 | 65 | 0 | 0 | 170 | 282 | 0 | 0 | 20 | 55 | 340 | 3225 | 0 | 0 | 0 | 0 | 0 | 0 | 50 | 43 | 330 | 2384 | 330 | 2427 | 0 | 0 | 0 | 0 | 60 | 213 | 270 | 1009 |
... (8888 rows omitted)
Some observations:
'N02650' is the number of returns that included a total income amount, and 'A02650' is the total amount of total income (in thousands of dollars) from those returns.For the analysis we're about to do, we won't need to use the information about tax brackets. We will need to know the total income, total number of returns, and other totals from each ZIP code.
Question 2.1. Assign the name income_by_zipcode to a table with just one row per ZIP code. When you group according to ZIP code, the remaining columns should be summed. In other words, for any other column such as 'N02650', the value of 'N02650' in a row corresponding to ZIP code 90210 (for example) should be the sum of the values of 'N02650' in the 6 rows of income_raw corresponding to ZIP code 90210.
income_by_zipcode = income_raw.group("ZIP",np.sum)
income_by_zipcode
| ZIP | N1 sum | MARS1 sum | MARS2 sum | MARS4 sum | PREP sum | N2 sum | NUMDEP sum | A00100 sum | N02650 sum | A02650 sum | N00200 sum | A00200 sum | N00300 sum | A00300 sum | N00600 sum | A00600 sum | N00650 sum | A00650 sum | N00700 sum | A00700 sum | N00900 sum | A00900 sum | N01000 sum | A01000 sum | N01400 sum | A01400 sum | N01700 sum | A01700 sum | SCHF sum | N02300 sum | A02300 sum | N02500 sum | A02500 sum | N26270 sum | A26270 sum | N02900 sum | A02900 sum | N03220 sum | A03220 sum | N03300 sum | A03300 sum | N03270 sum | A03270 sum | N03150 sum | A03150 sum | N03210 sum | A03210 sum | N03230 sum | A03230 sum | N03240 sum | A03240 sum | N04470 sum | A04470 sum | A00101 sum | N18425 sum | A18425 sum | N18450 sum | A18450 sum | N18500 sum | A18500 sum | N18300 sum | A18300 sum | N19300 sum | A19300 sum | N19700 sum | A19700 sum | N04800 sum | A04800 sum | N05800 sum | A05800 sum | N09600 sum | A09600 sum | N07100 sum | A07100 sum | N07300 sum | A07300 sum | N07180 sum | A07180 sum | N07230 sum | A07230 sum | N07240 sum | A07240 sum | N07220 sum | A07220 sum | N07260 sum | A07260 sum | N09400 sum | A09400 sum | N10600 sum | A10600 sum | N59660 sum | A59660 sum | N59720 sum | A59720 sum | N11070 sum | A11070 sum | N10960 sum | A10960 sum | N06500 sum | A06500 sum | N10300 sum | A10300 sum | N85330 sum | A85330 sum | N85300 sum | A85300 sum | N11901 sum | A11901 sum | N11902 sum | A11902 sum |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 90001 | 20970 | 9010 | 4930 | 6980 | 17360 | 53540 | 28170 | 527457 | 20970 | 531772 | 17740 | 456544 | 2070 | 241 | 170 | 208 | 130 | 68 | 1530 | 1258 | 4220 | 40103 | 190 | 183 | 150 | 1531 | 890 | 11571 | 0 | 1600 | 8137 | 380 | 2338 | 90 | 3903 | 4520 | 4161 | 120 | 32 | 0 | 0 | 0 | 0 | 50 | 152 | 470 | 370 | 110 | 179 | 0 | 0 | 2790 | 49827 | 136004 | 1820 | 4543 | 850 | 1020 | 1910 | 5712 | 2770 | 13037 | 1900 | 17162 | 1950 | 4222 | 11460 | 180741 | 11440 | 24020 | 30 | 124 | 6350 | 5583 | 0 | 0 | 610 | 404 | 1380 | 1030 | 950 | 164 | 4530 | 3897 | 40 | 13 | 3820 | 6064 | 19960 | 77235 | 8480 | 25093 | 7770 | 20943 | 9210 | 13221 | 2050 | 1594 | 7640 | 18417 | 11060 | 24663 | 0 | 0 | 0 | 0 | 2110 | 3164 | 18460 | 56128 |
| 90002 | 18960 | 8230 | 3830 | 6800 | 15120 | 47200 | 24850 | 462823 | 18960 | 467128 | 15910 | 396088 | 1960 | 227 | 170 | 72 | 130 | 30 | 1450 | 1196 | 4310 | 40865 | 170 | 257 | 130 | 985 | 950 | 14167 | 0 | 1450 | 7131 | 360 | 2344 | 30 | 116 | 4610 | 4305 | 130 | 29 | 0 | 0 | 0 | 0 | 60 | 161 | 470 | 411 | 100 | 202 | 0 | 0 | 2910 | 50033 | 133338 | 1880 | 3809 | 900 | 1003 | 2070 | 5698 | 2890 | 12370 | 2100 | 18397 | 2060 | 5197 | 10070 | 153993 | 10060 | 19696 | 20 | 24 | 5660 | 4942 | 0 | 0 | 610 | 403 | 1350 | 1006 | 810 | 138 | 3910 | 3362 | 50 | 15 | 3900 | 6159 | 18270 | 73561 | 8960 | 27858 | 8300 | 23004 | 8710 | 12170 | 2280 | 1943 | 6520 | 14756 | 10180 | 21012 | 0 | 0 | 0 | 0 | 1630 | 2176 | 17040 | 55190 |
| 90003 | 26180 | 11310 | 5130 | 9640 | 20570 | 64470 | 33760 | 612733 | 26180 | 618848 | 21680 | 518846 | 2410 | 393 | 220 | 100 | 170 | 39 | 1690 | 1339 | 6240 | 59400 | 230 | 275 | 170 | 1528 | 1120 | 16814 | 0 | 1880 | 9002 | 490 | 3204 | 90 | 1173 | 6490 | 6102 | 150 | 39 | 0 | 0 | 30 | 97 | 90 | 262 | 560 | 477 | 160 | 284 | 0 | 0 | 3300 | 57436 | 147488 | 2110 | 4120 | 1000 | 1089 | 2250 | 6165 | 3280 | 13216 | 2290 | 21225 | 2260 | 5650 | 13410 | 196510 | 13410 | 24756 | 20 | 19 | 7510 | 6410 | 0 | 0 | 780 | 497 | 1780 | 1335 | 1060 | 184 | 5210 | 4347 | 50 | 13 | 5620 | 8885 | 25030 | 95254 | 11910 | 36094 | 10940 | 29504 | 11950 | 16777 | 3190 | 2789 | 8680 | 18348 | 13880 | 27416 | 0 | 0 | 0 | 0 | 2670 | 3534 | 23100 | 72074 |
| 90004 | 27360 | 15330 | 7000 | 4670 | 20260 | 51180 | 17800 | 1.61777e+06 | 27360 | 1.64943e+06 | 22010 | 1.02928e+06 | 5270 | 20986 | 2890 | 37375 | 2670 | 26861 | 2650 | 7675 | 7390 | 102468 | 2820 | 83652 | 910 | 18993 | 1660 | 36911 | 0 | 1840 | 11778 | 1310 | 18705 | 1480 | 216013 | 9000 | 31663 | 280 | 69 | 240 | 8545 | 810 | 5683 | 440 | 2188 | 1740 | 1811 | 260 | 562 | 30 | 555 | 5510 | 240787 | 1.07313e+06 | 4470 | 98390 | 860 | 996 | 2910 | 25109 | 5470 | 126876 | 2560 | 45167 | 4390 | 37387 | 18590 | 1.07024e+06 | 18510 | 259534 | 1010 | 10245 | 6420 | 7893 | 990 | 1973 | 620 | 361 | 1680 | 1378 | 990 | 175 | 3010 | 2817 | 50 | 25 | 6600 | 17090 | 25170 | 318535 | 7180 | 14936 | 5470 | 11397 | 4720 | 6107 | 1990 | 1554 | 16120 | 252178 | 20150 | 276712 | 690 | 2166 | 820 | 4768 | 6360 | 32663 | 19660 | 59388 |
| 90005 | 15430 | 8550 | 3870 | 2830 | 11210 | 29910 | 11130 | 707020 | 15430 | 717290 | 12610 | 454410 | 2230 | 5575 | 1000 | 5358 | 910 | 3785 | 990 | 2305 | 3950 | 46369 | 1080 | 19080 | 270 | 5008 | 590 | 11898 | 0 | 830 | 5514 | 450 | 5601 | 650 | 112194 | 4900 | 10270 | 120 | 29 | 50 | 1631 | 250 | 1436 | 220 | 1064 | 990 | 990 | 150 | 367 | 0 | 0 | 2170 | 91544 | 387528 | 1620 | 37572 | 460 | 514 | 1060 | 8951 | 2170 | 47820 | 980 | 16705 | 1580 | 10994 | 9720 | 433364 | 9700 | 99692 | 260 | 2993 | 3240 | 3165 | 260 | 264 | 300 | 193 | 920 | 752 | 470 | 87 | 1700 | 1480 | 0 | 0 | 3550 | 7585 | 14110 | 125690 | 4140 | 8192 | 3160 | 6289 | 3220 | 4378 | 1110 | 881 | 8210 | 96559 | 10540 | 105653 | 180 | 545 | 190 | 713 | 3550 | 14752 | 11260 | 28426 |
| 90006 | 22630 | 11470 | 5400 | 5630 | 17840 | 47590 | 20210 | 563530 | 22630 | 571157 | 18360 | 466220 | 2130 | 841 | 560 | 949 | 490 | 631 | 980 | 1084 | 5730 | 54327 | 610 | 3191 | 230 | 1827 | 650 | 7700 | 0 | 1220 | 6570 | 470 | 3388 | 340 | 11002 | 6500 | 7628 | 120 | 28 | 0 | 0 | 130 | 402 | 190 | 898 | 920 | 910 | 130 | 305 | 0 | 0 | 2040 | 41866 | 130856 | 1380 | 5958 | 550 | 636 | 970 | 3901 | 2020 | 11393 | 940 | 11218 | 1400 | 4235 | 12680 | 233059 | 12640 | 33292 | 70 | 260 | 4980 | 4065 | 100 | 12 | 390 | 247 | 1310 | 1031 | 760 | 147 | 3080 | 2517 | 0 | 0 | 5320 | 8272 | 20900 | 72585 | 7800 | 18122 | 6230 | 14213 | 6600 | 8904 | 1780 | 1391 | 9890 | 29220 | 14110 | 37674 | 0 | 0 | 20 | 82 | 4350 | 6859 | 17660 | 41538 |
| 90007 | 11710 | 6350 | 2270 | 3020 | 8310 | 23380 | 9950 | 311779 | 11710 | 315581 | 9890 | 264552 | 1200 | 402 | 350 | 906 | 290 | 497 | 720 | 752 | 2490 | 22035 | 350 | 1824 | 130 | 1109 | 550 | 7843 | 0 | 800 | 4268 | 270 | 2251 | 110 | 3516 | 3090 | 3800 | 130 | 33 | 0 | 0 | 50 | 143 | 80 | 239 | 570 | 501 | 200 | 475 | 0 | 0 | 1380 | 26294 | 86932 | 990 | 3815 | 300 | 381 | 670 | 2407 | 1360 | 7221 | 660 | 7198 | 1010 | 3391 | 6950 | 142055 | 6930 | 20792 | 50 | 204 | 2990 | 2600 | 60 | 4 | 280 | 180 | 950 | 744 | 510 | 94 | 1700 | 1527 | 0 | 0 | 2220 | 3477 | 11030 | 42879 | 4110 | 9965 | 3570 | 8111 | 3170 | 4055 | 1210 | 974 | 5400 | 18163 | 7210 | 21810 | 0 | 0 | 0 | 0 | 1740 | 2815 | 9700 | 24024 |
| 90008 | 14710 | 8060 | 2310 | 4110 | 9990 | 27000 | 10310 | 662036 | 14710 | 668523 | 11380 | 473516 | 3220 | 1794 | 1270 | 3416 | 1090 | 2092 | 2580 | 4148 | 3540 | 26062 | 1070 | 10633 | 790 | 11251 | 2780 | 97603 | 0 | 1220 | 8145 | 1560 | 16870 | 290 | 10822 | 3790 | 6488 | 330 | 83 | 40 | 658 | 180 | 840 | 140 | 492 | 760 | 770 | 140 | 284 | 0 | 0 | 4720 | 123266 | 415395 | 3600 | 20978 | 920 | 1277 | 3300 | 13216 | 4700 | 38053 | 3040 | 39860 | 4140 | 19875 | 10190 | 386426 | 10160 | 67957 | 310 | 1493 | 3630 | 3434 | 190 | 23 | 550 | 337 | 1280 | 1145 | 510 | 82 | 1750 | 1627 | 110 | 64 | 2560 | 4976 | 13970 | 92620 | 4430 | 11786 | 3990 | 9400 | 3130 | 3892 | 1850 | 1619 | 8670 | 64525 | 10560 | 70194 | 170 | 177 | 150 | 179 | 2780 | 10578 | 11500 | 32783 |
| 90010 | 2210 | 1270 | 690 | 210 | 1760 | 3790 | 960 | 314333 | 2210 | 320471 | 1510 | 142280 | 780 | 8813 | 450 | 8701 | 400 | 6573 | 350 | 3145 | 680 | 20562 | 550 | 41200 | 90 | 2144 | 170 | 6049 | 0 | 90 | 632 | 180 | 3090 | 530 | 60900 | 870 | 6148 | 0 | 0 | 30 | 1103 | 170 | 1219 | 70 | 435 | 150 | 161 | 0 | 0 | 0 | 0 | 900 | 70940 | 272867 | 650 | 24978 | 160 | 224 | 600 | 6912 | 880 | 32771 | 520 | 9838 | 650 | 21417 | 1660 | 224561 | 1660 | 63634 | 180 | 2287 | 470 | 1626 | 140 | 902 | 40 | 25 | 100 | 98 | 40 | 9 | 170 | 170 | 0 | 0 | 650 | 3148 | 1840 | 62807 | 360 | 523 | 230 | 335 | 130 | 151 | 110 | 97 | 1540 | 62160 | 1790 | 67475 | 110 | 535 | 150 | 2053 | 820 | 18448 | 1140 | 7191 |
| 90011 | 36670 | 15540 | 8600 | 12390 | 30240 | 95640 | 51260 | 857731 | 36670 | 864961 | 31420 | 746856 | 3120 | 588 | 250 | 158 | 190 | 53 | 1940 | 1560 | 7160 | 70288 | 250 | 2056 | 170 | 1533 | 1150 | 14963 | 0 | 2280 | 10594 | 480 | 3262 | 150 | 1791 | 7780 | 6988 | 200 | 46 | 0 | 0 | 40 | 56 | 80 | 265 | 830 | 690 | 160 | 308 | 0 | 0 | 3800 | 66782 | 177957 | 2350 | 5318 | 1280 | 1332 | 2540 | 7374 | 3790 | 16048 | 2580 | 24554 | 2550 | 5437 | 18630 | 265130 | 18610 | 34509 | 40 | 90 | 10350 | 8588 | 0 | 0 | 940 | 604 | 2160 | 1528 | 1290 | 228 | 7560 | 6161 | 60 | 14 | 6600 | 10310 | 34770 | 123045 | 14220 | 41698 | 12950 | 34874 | 17190 | 25293 | 3450 | 2649 | 11970 | 25835 | 17980 | 36385 | 0 | 0 | 0 | 0 | 3830 | 4957 | 32070 | 92906 |
... (1473 rows omitted)
Your income_by_zipcode table probably has column names like N1 sum, which looks a little weird.
Question 2.2. Relabel the columns in income_by_zipcode to match the labels in income_raw
Hint: Inspect income_raw.labels and income_by_zipcode.labels to find the differences you need to change.
Hint 2: Since there are many columns, it will be easier to relabel each of them by using a for statement. See Chapter 8 of the textbook for details.
Hint 3: You can use the replace method of a string to remove excess content. See lab02 for examples.
Hint 4: To create a new table from an existing table with one label replaced, use relabeled. To change a label in an existing table permanently, use relabel. Both methods take two arguments: the old label and the new label. You can solve this problem with either one, but relabel is simpler.
for x in income_by_zipcode.labels:
income_by_zipcode.relabel(x,x.replace(" sum",""))
income_by_zipcode
| ZIP | N1 | MARS1 | MARS2 | MARS4 | PREP | N2 | NUMDEP | A00100 | N02650 | A02650 | N00200 | A00200 | N00300 | A00300 | N00600 | A00600 | N00650 | A00650 | N00700 | A00700 | N00900 | A00900 | N01000 | A01000 | N01400 | A01400 | N01700 | A01700 | SCHF | N02300 | A02300 | N02500 | A02500 | N26270 | A26270 | N02900 | A02900 | N03220 | A03220 | N03300 | A03300 | N03270 | A03270 | N03150 | A03150 | N03210 | A03210 | N03230 | A03230 | N03240 | A03240 | N04470 | A04470 | A00101 | N18425 | A18425 | N18450 | A18450 | N18500 | A18500 | N18300 | A18300 | N19300 | A19300 | N19700 | A19700 | N04800 | A04800 | N05800 | A05800 | N09600 | A09600 | N07100 | A07100 | N07300 | A07300 | N07180 | A07180 | N07230 | A07230 | N07240 | A07240 | N07220 | A07220 | N07260 | A07260 | N09400 | A09400 | N10600 | A10600 | N59660 | A59660 | N59720 | A59720 | N11070 | A11070 | N10960 | A10960 | N06500 | A06500 | N10300 | A10300 | N85330 | A85330 | N85300 | A85300 | N11901 | A11901 | N11902 | A11902 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 90001 | 20970 | 9010 | 4930 | 6980 | 17360 | 53540 | 28170 | 527457 | 20970 | 531772 | 17740 | 456544 | 2070 | 241 | 170 | 208 | 130 | 68 | 1530 | 1258 | 4220 | 40103 | 190 | 183 | 150 | 1531 | 890 | 11571 | 0 | 1600 | 8137 | 380 | 2338 | 90 | 3903 | 4520 | 4161 | 120 | 32 | 0 | 0 | 0 | 0 | 50 | 152 | 470 | 370 | 110 | 179 | 0 | 0 | 2790 | 49827 | 136004 | 1820 | 4543 | 850 | 1020 | 1910 | 5712 | 2770 | 13037 | 1900 | 17162 | 1950 | 4222 | 11460 | 180741 | 11440 | 24020 | 30 | 124 | 6350 | 5583 | 0 | 0 | 610 | 404 | 1380 | 1030 | 950 | 164 | 4530 | 3897 | 40 | 13 | 3820 | 6064 | 19960 | 77235 | 8480 | 25093 | 7770 | 20943 | 9210 | 13221 | 2050 | 1594 | 7640 | 18417 | 11060 | 24663 | 0 | 0 | 0 | 0 | 2110 | 3164 | 18460 | 56128 |
| 90002 | 18960 | 8230 | 3830 | 6800 | 15120 | 47200 | 24850 | 462823 | 18960 | 467128 | 15910 | 396088 | 1960 | 227 | 170 | 72 | 130 | 30 | 1450 | 1196 | 4310 | 40865 | 170 | 257 | 130 | 985 | 950 | 14167 | 0 | 1450 | 7131 | 360 | 2344 | 30 | 116 | 4610 | 4305 | 130 | 29 | 0 | 0 | 0 | 0 | 60 | 161 | 470 | 411 | 100 | 202 | 0 | 0 | 2910 | 50033 | 133338 | 1880 | 3809 | 900 | 1003 | 2070 | 5698 | 2890 | 12370 | 2100 | 18397 | 2060 | 5197 | 10070 | 153993 | 10060 | 19696 | 20 | 24 | 5660 | 4942 | 0 | 0 | 610 | 403 | 1350 | 1006 | 810 | 138 | 3910 | 3362 | 50 | 15 | 3900 | 6159 | 18270 | 73561 | 8960 | 27858 | 8300 | 23004 | 8710 | 12170 | 2280 | 1943 | 6520 | 14756 | 10180 | 21012 | 0 | 0 | 0 | 0 | 1630 | 2176 | 17040 | 55190 |
| 90003 | 26180 | 11310 | 5130 | 9640 | 20570 | 64470 | 33760 | 612733 | 26180 | 618848 | 21680 | 518846 | 2410 | 393 | 220 | 100 | 170 | 39 | 1690 | 1339 | 6240 | 59400 | 230 | 275 | 170 | 1528 | 1120 | 16814 | 0 | 1880 | 9002 | 490 | 3204 | 90 | 1173 | 6490 | 6102 | 150 | 39 | 0 | 0 | 30 | 97 | 90 | 262 | 560 | 477 | 160 | 284 | 0 | 0 | 3300 | 57436 | 147488 | 2110 | 4120 | 1000 | 1089 | 2250 | 6165 | 3280 | 13216 | 2290 | 21225 | 2260 | 5650 | 13410 | 196510 | 13410 | 24756 | 20 | 19 | 7510 | 6410 | 0 | 0 | 780 | 497 | 1780 | 1335 | 1060 | 184 | 5210 | 4347 | 50 | 13 | 5620 | 8885 | 25030 | 95254 | 11910 | 36094 | 10940 | 29504 | 11950 | 16777 | 3190 | 2789 | 8680 | 18348 | 13880 | 27416 | 0 | 0 | 0 | 0 | 2670 | 3534 | 23100 | 72074 |
| 90004 | 27360 | 15330 | 7000 | 4670 | 20260 | 51180 | 17800 | 1.61777e+06 | 27360 | 1.64943e+06 | 22010 | 1.02928e+06 | 5270 | 20986 | 2890 | 37375 | 2670 | 26861 | 2650 | 7675 | 7390 | 102468 | 2820 | 83652 | 910 | 18993 | 1660 | 36911 | 0 | 1840 | 11778 | 1310 | 18705 | 1480 | 216013 | 9000 | 31663 | 280 | 69 | 240 | 8545 | 810 | 5683 | 440 | 2188 | 1740 | 1811 | 260 | 562 | 30 | 555 | 5510 | 240787 | 1.07313e+06 | 4470 | 98390 | 860 | 996 | 2910 | 25109 | 5470 | 126876 | 2560 | 45167 | 4390 | 37387 | 18590 | 1.07024e+06 | 18510 | 259534 | 1010 | 10245 | 6420 | 7893 | 990 | 1973 | 620 | 361 | 1680 | 1378 | 990 | 175 | 3010 | 2817 | 50 | 25 | 6600 | 17090 | 25170 | 318535 | 7180 | 14936 | 5470 | 11397 | 4720 | 6107 | 1990 | 1554 | 16120 | 252178 | 20150 | 276712 | 690 | 2166 | 820 | 4768 | 6360 | 32663 | 19660 | 59388 |
| 90005 | 15430 | 8550 | 3870 | 2830 | 11210 | 29910 | 11130 | 707020 | 15430 | 717290 | 12610 | 454410 | 2230 | 5575 | 1000 | 5358 | 910 | 3785 | 990 | 2305 | 3950 | 46369 | 1080 | 19080 | 270 | 5008 | 590 | 11898 | 0 | 830 | 5514 | 450 | 5601 | 650 | 112194 | 4900 | 10270 | 120 | 29 | 50 | 1631 | 250 | 1436 | 220 | 1064 | 990 | 990 | 150 | 367 | 0 | 0 | 2170 | 91544 | 387528 | 1620 | 37572 | 460 | 514 | 1060 | 8951 | 2170 | 47820 | 980 | 16705 | 1580 | 10994 | 9720 | 433364 | 9700 | 99692 | 260 | 2993 | 3240 | 3165 | 260 | 264 | 300 | 193 | 920 | 752 | 470 | 87 | 1700 | 1480 | 0 | 0 | 3550 | 7585 | 14110 | 125690 | 4140 | 8192 | 3160 | 6289 | 3220 | 4378 | 1110 | 881 | 8210 | 96559 | 10540 | 105653 | 180 | 545 | 190 | 713 | 3550 | 14752 | 11260 | 28426 |
| 90006 | 22630 | 11470 | 5400 | 5630 | 17840 | 47590 | 20210 | 563530 | 22630 | 571157 | 18360 | 466220 | 2130 | 841 | 560 | 949 | 490 | 631 | 980 | 1084 | 5730 | 54327 | 610 | 3191 | 230 | 1827 | 650 | 7700 | 0 | 1220 | 6570 | 470 | 3388 | 340 | 11002 | 6500 | 7628 | 120 | 28 | 0 | 0 | 130 | 402 | 190 | 898 | 920 | 910 | 130 | 305 | 0 | 0 | 2040 | 41866 | 130856 | 1380 | 5958 | 550 | 636 | 970 | 3901 | 2020 | 11393 | 940 | 11218 | 1400 | 4235 | 12680 | 233059 | 12640 | 33292 | 70 | 260 | 4980 | 4065 | 100 | 12 | 390 | 247 | 1310 | 1031 | 760 | 147 | 3080 | 2517 | 0 | 0 | 5320 | 8272 | 20900 | 72585 | 7800 | 18122 | 6230 | 14213 | 6600 | 8904 | 1780 | 1391 | 9890 | 29220 | 14110 | 37674 | 0 | 0 | 20 | 82 | 4350 | 6859 | 17660 | 41538 |
| 90007 | 11710 | 6350 | 2270 | 3020 | 8310 | 23380 | 9950 | 311779 | 11710 | 315581 | 9890 | 264552 | 1200 | 402 | 350 | 906 | 290 | 497 | 720 | 752 | 2490 | 22035 | 350 | 1824 | 130 | 1109 | 550 | 7843 | 0 | 800 | 4268 | 270 | 2251 | 110 | 3516 | 3090 | 3800 | 130 | 33 | 0 | 0 | 50 | 143 | 80 | 239 | 570 | 501 | 200 | 475 | 0 | 0 | 1380 | 26294 | 86932 | 990 | 3815 | 300 | 381 | 670 | 2407 | 1360 | 7221 | 660 | 7198 | 1010 | 3391 | 6950 | 142055 | 6930 | 20792 | 50 | 204 | 2990 | 2600 | 60 | 4 | 280 | 180 | 950 | 744 | 510 | 94 | 1700 | 1527 | 0 | 0 | 2220 | 3477 | 11030 | 42879 | 4110 | 9965 | 3570 | 8111 | 3170 | 4055 | 1210 | 974 | 5400 | 18163 | 7210 | 21810 | 0 | 0 | 0 | 0 | 1740 | 2815 | 9700 | 24024 |
| 90008 | 14710 | 8060 | 2310 | 4110 | 9990 | 27000 | 10310 | 662036 | 14710 | 668523 | 11380 | 473516 | 3220 | 1794 | 1270 | 3416 | 1090 | 2092 | 2580 | 4148 | 3540 | 26062 | 1070 | 10633 | 790 | 11251 | 2780 | 97603 | 0 | 1220 | 8145 | 1560 | 16870 | 290 | 10822 | 3790 | 6488 | 330 | 83 | 40 | 658 | 180 | 840 | 140 | 492 | 760 | 770 | 140 | 284 | 0 | 0 | 4720 | 123266 | 415395 | 3600 | 20978 | 920 | 1277 | 3300 | 13216 | 4700 | 38053 | 3040 | 39860 | 4140 | 19875 | 10190 | 386426 | 10160 | 67957 | 310 | 1493 | 3630 | 3434 | 190 | 23 | 550 | 337 | 1280 | 1145 | 510 | 82 | 1750 | 1627 | 110 | 64 | 2560 | 4976 | 13970 | 92620 | 4430 | 11786 | 3990 | 9400 | 3130 | 3892 | 1850 | 1619 | 8670 | 64525 | 10560 | 70194 | 170 | 177 | 150 | 179 | 2780 | 10578 | 11500 | 32783 |
| 90010 | 2210 | 1270 | 690 | 210 | 1760 | 3790 | 960 | 314333 | 2210 | 320471 | 1510 | 142280 | 780 | 8813 | 450 | 8701 | 400 | 6573 | 350 | 3145 | 680 | 20562 | 550 | 41200 | 90 | 2144 | 170 | 6049 | 0 | 90 | 632 | 180 | 3090 | 530 | 60900 | 870 | 6148 | 0 | 0 | 30 | 1103 | 170 | 1219 | 70 | 435 | 150 | 161 | 0 | 0 | 0 | 0 | 900 | 70940 | 272867 | 650 | 24978 | 160 | 224 | 600 | 6912 | 880 | 32771 | 520 | 9838 | 650 | 21417 | 1660 | 224561 | 1660 | 63634 | 180 | 2287 | 470 | 1626 | 140 | 902 | 40 | 25 | 100 | 98 | 40 | 9 | 170 | 170 | 0 | 0 | 650 | 3148 | 1840 | 62807 | 360 | 523 | 230 | 335 | 130 | 151 | 110 | 97 | 1540 | 62160 | 1790 | 67475 | 110 | 535 | 150 | 2053 | 820 | 18448 | 1140 | 7191 |
| 90011 | 36670 | 15540 | 8600 | 12390 | 30240 | 95640 | 51260 | 857731 | 36670 | 864961 | 31420 | 746856 | 3120 | 588 | 250 | 158 | 190 | 53 | 1940 | 1560 | 7160 | 70288 | 250 | 2056 | 170 | 1533 | 1150 | 14963 | 0 | 2280 | 10594 | 480 | 3262 | 150 | 1791 | 7780 | 6988 | 200 | 46 | 0 | 0 | 40 | 56 | 80 | 265 | 830 | 690 | 160 | 308 | 0 | 0 | 3800 | 66782 | 177957 | 2350 | 5318 | 1280 | 1332 | 2540 | 7374 | 3790 | 16048 | 2580 | 24554 | 2550 | 5437 | 18630 | 265130 | 18610 | 34509 | 40 | 90 | 10350 | 8588 | 0 | 0 | 940 | 604 | 2160 | 1528 | 1290 | 228 | 7560 | 6161 | 60 | 14 | 6600 | 10310 | 34770 | 123045 | 14220 | 41698 | 12950 | 34874 | 17190 | 25293 | 3450 | 2649 | 11970 | 25835 | 17980 | 36385 | 0 | 0 | 0 | 0 | 3830 | 4957 | 32070 | 92906 |
... (1473 rows omitted)
Question 2.3.
Create a table called income with one row per ZIP code and the following columns.
ZIP column with the same contents as 'ZIP' from income_by_zipcode.num returns column containing the total number of tax returns that include a total income amount (column 'N02650' from income_by_zipcode).total income ($) column containing the total income in all tax returns in thousands of dollars (column 'A02650' from income_by_zipcode).num farmers column containing the number of farmer returns (column 'SCHF' from income_by_zipcode).income = Table().with_columns(
"ZIP", income_by_zipcode["ZIP"],
"num returns", income_by_zipcode["N02650"],
"total income ($)", income_by_zipcode["A02650"],
"num farmers", income_by_zipcode["SCHF"]
)
income.set_format('total income ($)', NumberFormatter(0)).show(5)
| ZIP | num returns | total income ($) | num farmers |
|---|---|---|---|
| 90001 | 20970 | 531,772 | 0 |
| 90002 | 18960 | 467,128 | 0 |
| 90003 | 26180 | 618,848 | 0 |
| 90004 | 27360 | 1,649,431 | 0 |
| 90005 | 15430 | 717,290 | 0 |
... (1478 rows omitted)
Question 2.4. All ZIP codes with less than 100 returns (or some other special conditions) are grouped together into one ZIP code with a special code. Remove the row for that ZIP code from the income table.
Hint 1: This ZIP code value has far more returns than any of the other ZIP codes. Try using group and sort to find it.
Hint 2: To remove a row in the income table using where, assign income to the smaller table using the following expression structure:
income = income.where(...)
Hint 3: Each ZIP code is represented as a string, not an int.
removeZIP = income.sort("num returns", descending=True)["ZIP"].item(0)
income = income.where("ZIP", are.not_containing(removeZIP))
income
| ZIP | num returns | total income ($) | num farmers |
|---|---|---|---|
| 90001 | 20970 | 531,772 | 0 |
| 90002 | 18960 | 467,128 | 0 |
| 90003 | 26180 | 618,848 | 0 |
| 90004 | 27360 | 1,649,431 | 0 |
| 90005 | 15430 | 717,290 | 0 |
| 90006 | 22630 | 571,157 | 0 |
| 90007 | 11710 | 315,581 | 0 |
| 90008 | 14710 | 668,523 | 0 |
| 90010 | 2210 | 320,471 | 0 |
| 90011 | 36670 | 864,961 | 0 |
... (1472 rows omitted)
Because each ZIP code has a different number of people, computing the average income across several ZIP codes requires some care. This will come up several times in this project. Here is a simple example:
Question 2.5 Among all the tax returns that
what is the average total income? Express the answer in dollars as an int rounded to the nearest dollar.
# total income of the poeple living in 94576
total_income94576 = income.where("ZIP", are.equal_to("94576"))["total income ($)"]
# tax return of the poeple living in 94576
tax_return94576 = income.where("ZIP", are.equal_to("94576"))["num returns"]
# total income of the people living in 94704
total_income94704 = income.where("ZIP", are.equal_to("94704"))["total income ($)"]
# tax return of the poeple living in 94704
tax_return94704 = income.where("ZIP", are.equal_to("94704"))["num returns"]
average_income = round(sum(total_income94576+total_income94704)*1000/sum(tax_return94576+tax_return94704))
average_income
52773
Question 2.6. Among all California tax returns that include a total income amount, what is the average total income? Express the answer in dollars as an int rounded to the nearest dollar.
avg_total = round(sum(income["total income ($)"])*1000/sum(income["num returns"]))
avg_total
72791
Farms use water, so it's plausible that farming is an important factor in water usage. Here, we will check for a relationship between farming and income.
Among the tax returns in California for ZIP codes represented in the income table, is there an association between income and living in a ZIP code with a lot of farmers?
We'll try to answer the question in 3 ways.
Question 2.7. Make a scatter plot with one point for each ZIP code. Display the average income in dollars on the vertical axis and the proportion of returns that are from farmers on the horizontal axis.
# make a table containing the average income in dollars for each Zip code
avg_income = income.with_columns("average income", income["total income ($)"]*1000/income["num returns"],
"proportion of farmers", income["num farmers"]/income["num returns"])
avg_income.scatter("proportion of farmers", "average income")
Question 2.8. From the graph, can you say whether ZIP codes with more farmers typically have lower or higher average income than ZIP codes with few or no farmers? Can you say how much lower or higher?
The range in the average income is greater in the areas with less proportion of farmers than the areas with greater proportion of farmers. The areas with the greater proportion of farmers have a steady average incomes. We cannot say that the ZIP codes with more farmers typically have lower or higher average income because the range in the average income is great in the areas with low number of farmers.
Question 2.9. Compare the average incomes for two groups of tax returns: those in ZIP codes with a greater-than-average proportion of farmers and those in ZIP codes with a less-than-average (or average) proportion. Make sure both of these values are displayed (preferably in a table). Then, describe your findings.
Hint: Make sure your result correctly accounts for the number of tax returns in each ZIP code, as in questions 2.5 and 2.6.
# find the average proportion of farmers
farmer_avg_total = sum(income["num farmers"])/sum(income["num returns"])
# add a column of the average proportion of farmers in the table
income_farmers = income.with_column("average farmers", income["num farmers"]/income["num returns"])
# table with a greater-than-average proportion of farmers
farmers_greater = income_farmers.where("average farmers", are.above(farmer_avg_total))
# get the average incomes of tax returns in dollars rounded to the nearest dollar
income_avg_greater = round(sum(farmers_greater["total income ($)"])*1000/sum(farmers_greater["num returns"]))
# table with a less-than-average proportion of farmers
farmers_less = income_farmers.where("average farmers", are.below_or_equal_to(farmer_avg_total))
# get the average incomes of tax returns in dollars rounded to the nearest dollar
income_avg_less = round(sum(farmers_less["total income ($)"])*1000/sum(farmers_less["num returns"]))
# Build and display a table with two rows:
# 1) incomes of returns in ZIP codes with a greater-than-average proportion of farmers
# 2) incomes of returns in other ZIP codes
avg_income_farmers = Table().with_columns("avg farmer proportion", ["above-average", "below-average"],
"average income", [income_avg_greater, income_avg_less])
avg_income_farmers
| avg farmer proportion | average income |
|---|---|
| above-average | 78231 |
| below-average | 71464 |
The areas with the average number of farmers greater than the average proportion have the higher average income than the areas with the average number of farmers below the average proportion.
Question 2.10. The graph below displays two histograms: the distribution of average incomes of ZIP codes that have above-average proportions of farmers, and that of ZIP codes with below-average proportions of farmers.
Are ZIP codes with below-average proportions of farmers more or less likely to have very low incomes? Explain how your answer is consistent with your answer to question 2.8.
The ZIP codes with below-average proportions of farmers more likely to have very low incomes than the ZIP codes with above-average proportions of farmers. However, the ZIP codes with below-average proportions of farmers have more likely to have high average incomes too. We can also see this from the graph in question 2.7. In the graph, the range of the average incomes are very large when the proportion of farmers are small, which is consistent with the graph in question 2.10.
ZIP codes cover all the land in California and do not overlap. Here's a map of all of them.
Question 2.11. Among the ZIP codes represented in the income table, is there an association between high average income and some aspect of the ZIP code's location? If so, describe one aspect of the location that is clearly associated with high income.
Answer the question by drawing a map of all ZIP codes that have an average income above 100,000 dollars. Then, describe an association that you observe.
In order to create a map of certain ZIP codes, you need to:
high_average_zips.high_average_zips with the zip_features table to find the region for each ZIP code of interest.Map(...) on the column of features (provided).# Write code to draw a map of only the high-income ZIP codes.
# We have filled in some of it and suggested names for variables
# you might want to define.
zip_features = Table.from_records(zips.features)
high_average_zips = avg_income.where("average income", are.above(100000))
high_zips_with_region = zip_features.join('ZIP', high_average_zips)
Map(high_zips_with_region.column('feature'), width=400, height=300)
Most of the ZIP codes with high income are gathered together near Los Angeles, San Hose, and San Francisco. These cities are located near coast and known to others.
We will now investigate water usage in California. The usage table contains three columns:
PWSID: The Public Water Supply Identifier of the districtPopulation: Estimate of average population served in 2015Water: Average residential water use (gallons per person per day) in 2014-2015# Run this cell to create the usage table.
usage_raw.set_format(4, NumberFormatter)
max_pop = usage_raw.select(0, 1, 'population').group(0, max).relabeled(2, 'Population')
avg_water = usage_raw.select(0, 'res_gpcd').group(0, np.mean).relabeled(1, 'Water')
usage = max_pop.join('pwsid', avg_water).relabeled(0, 'PWSID').relabeled(1, 'District')
usage
| PWSID | District | Population | Water |
|---|---|---|---|
| 0110001 | Alameda County Water District | 340000 | 70.7 |
| 0110003 | California Water Service Company Livermore | 57450 | 90.2727 |
| 0110005 | East Bay Municipal Utilities District | 1390000 | 76 |
| 0110006 | Hayward, City of | 151037 | 57.1818 |
| 0110008 | Pleasanton, City of | 73067 | 96.6364 |
| 0110009 | Dublin San Ramon Services District | 79547 | 68.6364 |
| 0110011 | Livermore, City of | 31994 | 85.8182 |
| 0310003 | Amador Water Agency | 23347 | 82.8182 |
| 0410002 | California Water Service Company Chico District | 101447 | 142 |
| 0410005 | California Water Service Company Oroville | 11208 | 88.8182 |
... (401 rows omitted)
Question 3.1. Draw a map of the water districts, colored by the per capita water usage in each district.
Use the districts.color(...) method to generate the map. It takes as its first argument a two-column table with one row per district that has the district PWSID as its first column. The label of the second column is used in the legend of the map, and the values are used to color each region.
# We have filled in the call to districts.color(...). Set per_capita_usage
# to an appropriate table so that a map of all the water districts is
# displayed.
per_capita_usage = usage.select("PWSID","Water")
districts.color(per_capita_usage, key_on='feature.properties.PWSID')
Question 3.2. Based on the map above, which part of California appears to use more water per person: the San Francisco area or the Los Angeles area?
The Los Angeles area use more water per person. There are more areas with the darker color in the Los Angeles area compared to the San Francisco area.
Next, we will try to match each ZIP code with a water district. ZIP code boundaries do not always line up with water districts, and one water district often covers multiple ZIP codes, so this process is imprecise. It is even the case that some water districts overlap each other. Nonetheless, we can continue our analysis by matching each ZIP code to the water district with the largest geographic overlap.
The table wd_vs_zip describes the proportion of land in each ZIP code that is contained in each water district and vice versa. (The proportions are approximate because they do not correctly account for discontiguous districts, but they're mostly accurate.)
wd_vs_zip.show(5)
| PWSID | ZIP | District in ZIP | ZIP in District |
|---|---|---|---|
| 0110001 | 94536 | 9.41% | 68.51% |
| 0110001 | 94538 | 18.87% | 67.31% |
| 0110001 | 94539 | 13.13% | 44.36% |
| 0110005 | 94541 | 1.61% | 68.11% |
| 0110006 | 94541 | 18.68% | 98.46% |
... (3201 rows omitted)
Question 3.3. Complete the district_for_zip function that takes a ZIP code as its argument. It returns the PWSID with the largest value of ZIP in District for that zip_code, if that value is at least 50%. Otherwise, it returns the string 'No District'.
def district_for_zip(zip_code):
zip_code = str(zip_code) # Ensure that the ZIP code is a string, not an integer
districts = wd_vs_zip.where("ZIP", are.equal_to(zip_code)).sort("ZIP in District", descending=True)
at_least_half = districts.column("ZIP in District").item(0)
if at_least_half >= 0.5:
return districts["PWSID"].item(0)
else:
return 'No District'
district_for_zip(94709)
'0110005'
This function can be used to associate each ZIP code in the income table with a PWSID and discard ZIP codes that do not lie (mostly) in a water district.
zip_pwsids = income.apply(district_for_zip, 'ZIP')
income_with_pwsid = income.with_column('PWSID', zip_pwsids).where('PWSID', are.not_equal_to("No District"))
income_with_pwsid.set_format(2, NumberFormatter(0)).show(5)
| ZIP | num returns | total income ($) | num farmers | PWSID |
|---|---|---|---|---|
| 90001 | 20970 | 531,772 | 0 | 1910067 |
| 90022 | 26680 | 767,484 | 0 | 1910036 |
| 90024 | 14690 | 4,395,487 | 20 | 1910067 |
| 90025 | 25110 | 4,019,082 | 20 | 1910067 |
| 90034 | 29950 | 1,828,572 | 0 | 1910067 |
... (662 rows omitted)
Question 3.4. Create a table called district_data with one row per PWSID and the following columns:
PWSID: The ID of the districtPopulation: Population estimateWater: Average residential water use (gallons per person per day) in 2014-2015Income: Average income in dollars of all tax returns in ZIP codes that are (mostly) contained in the district according to income_with_pwsid.Hint: First create a district_income table that sums the incomes and returns for ZIP codes in each water district.
district_income = income_with_pwsid.group("PWSID", np.sum).drop(1,4)
## add a column with the average income in dollars of all tax returns
district_income = district_income.with_column(
"average income", district_income["total income ($) sum"]*1000/district_income["num returns sum"])
district_income = district_income.drop(1,2)
district_data = usage.join("PWSID", district_income).relabeled(4, "Income")
district_data
| PWSID | District | Population | Water | Income |
|---|---|---|---|---|
| 0110001 | Alameda County Water District | 340000 | 70.7 | 79032 |
| 0110005 | East Bay Municipal Utilities District | 1390000 | 76 | 82497.2 |
| 0110006 | Hayward, City of | 151037 | 57.1818 | 52923.5 |
| 0110008 | Pleasanton, City of | 73067 | 96.6364 | 163257 |
| 0110009 | Dublin San Ramon Services District | 79547 | 68.6364 | 133902 |
| 0410002 | California Water Service Company Chico District | 101447 | 142 | 50400.8 |
| 0410006 | South Feather Water and Power Agency | 18300 | 286.2 | 38720.5 |
| 0410011 | Del Oro Water Company | 9615 | 92.1818 | 44706.7 |
| 0710001 | Antioch, City of | 106455 | 110.273 | 53550.8 |
| 0710003 | Contra Costa Water District | 197536 | 101.636 | 73913.7 |
... (200 rows omitted)
Question 3.5. The bay_districts table gives the names of all water districts in the San Francisco Bay Area. Is there an association between water usage and income among Bay Area water districts? Use the tables you have created to compare water usage between the 10 Bay Area water districts with the highest average income and the rest of the Bay Area districts, then describe the association. Do not include any districts in your analysis for which you do not have income information.
The names below are just suggestions; you may perform the analysis in any way you wish.
Note: Some Bay Area water districts may not appear in your district_data table. That's ok. Perform your analysis only on the subset of districts where you have both water usage & income information.
bay_districts = Table.read_table('~/DS_113_S23/Projects/Project_1/bay_districts.csv')
## find the Bay Area districts
bay_district_data = district_data.where("District", are.contained_in(bay_districts[0]))
## sort the district_data table for the top average income
bay_district_data = bay_district_data.sort("Income", descending=True)
bay_district_data.show(10)
| PWSID | District | Population | Water | Income |
|---|---|---|---|---|
| 4110006 | California Water Service Company Bear Gulch | 58895 | 170.455 | 1.16049e+06 |
| 4310001 | California Water Service Company Los Altos/Suburban | 68163 | 124.545 | 349183 |
| 4310009 | Palo Alto, City of | 64403 | 89.1818 | 334057 |
| 4110017 | Menlo Park, City of | 16066 | 75.1818 | 278733 |
| 2110002 | Marin Municipal Water District | 188200 | 85.6364 | 176643 |
| 0110008 | Pleasanton, City of | 73067 | 96.6364 | 163257 |
| 4310007 | Mountain View, City of | 76781 | 65.5455 | 138570 |
| 4110021 | Estero Municipal Improvement District | 37165 | 66.3636 | 137893 |
| 4110001 | Mid-Peninsula Water District | 26730 | 83.7273 | 137537 |
| 4110008 | California Water Service Company Mid Peninsula | 135918 | 70.3636 | 135967 |
... (19 rows omitted)
Complete this one-sentence conclusion: In the Bay Area, people in the top 10 highest-income water districts used an average of 24.645 more gallons of water per person per day than people in the rest of the districts.
# the average of water usage in the top 10 highest-income water districts
top10_avg_water = (sum(bay_district_data["Water"][0:10]))/10
# the average of water usage in the rest of the water districts
rest_avg_water = sum(bay_district_data["Water"][10:len(bay_district_data["Water"])])/19
difference = top10_avg_water-rest_avg_water
difference
24.645454545454541
Question 3.6. In one paragraph, summarize what you have discovered through the analyses in this project and suggest what analysis should be conducted next to better understand California water usage, income, and geography. What additional data would be helpful in performing this next analysis?
When we see the maps, both the Los Angeles areas and San Francisco areas had the higher average income compared to other areas, but the Los Angeles areas have a higher water usage than the San Francisco areas. However, it is significant that the average income has a positive correlation with the water usage. For the better analysis, it would be good to note the number of farmers for each area, since the farmers use more water than the others and the areas with the higher average income usually did not had a lot of farmers.
Congratulations - you've finished Project 1 of Data 8!
To submit:
Run All from the Cell menu to ensure that you have executed all cells, including the test cells. Make sure that the visualizations you create are actually displayed.Download as HTML (.html) from the File menu. (Sometimes that seems to fail. If it does, you can download as HTML, open the .html file in your browser, and print it to a PDF.)If you want, draw some more maps below.
# Your extensions here (completely optional)